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ABSTRACT: 

A single gram of soil can contain thousands of unique 
bacterial species, of which only a small fraction is 
regularly cultured in the laboratory. Although the 
fermentation of cultured microorganisms has provided 
access to numerous bioactive secondary metabolites, with 
these same methods it is not possible to characterize the 
natural products encoded by the uncultured majority. 
The heterologous expression of biosynthetic gene clusters 
cloned from DNA extracted directly from environmental 
samples (eDNA) has the potential to provide access to the 
chemical diversity encoded in the genomes of uncultured 
bacteria. One of the challenges facing this approach has 
been that many natural product biosynthetic gene clusters 
are too large to be readily captured on a single fragment 
of cloned eDNA. The reassembly of large eDNA-derived 
natural product gene clusters from collections of smaller 
overlapping clones represents one potential solution to 
this problem. Unfortunately, traditional methods for the 
assembly of large DNA sequences from multiple 
overlapping clones can be technically challenging. Here 
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we present a general experimental framework that 
permits the recovery of large natural product biosynthetic 
gene clusters on overlapping soil-derived eDNA cosmid 
clones and the reassembly of these large gene clusters 
using transformation-associated recombination (TAR) in 
Saccharomyces cerevisiae. The development of practical 
methods for the rapid assembly of biosynthetic gene 
clusters from collections of overlapping eDNA clones is 
an important step toward being able to functionally 
study larger natural product gene clusters from 
uncultured bacteria. © 2010 Wiley Periodicals, Inc. 
Biopolymers 93: 833-844, 2010. 
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INTRODUCTION 

Cultured soil bacteria have been a productive source of 
both biologically active and structurally diverse natu- 
ral products. 1 ' 2 Molecular phylogenetic analysis of 
soil microbiomes now indicate that a single gram of 
soil can contain thousands of unique bacterial spe- 
cies, only a small fraction of which is regularly cultured in 
the laboratory. 3-6 Uncultured bacteria represent one of the 
largest pools of genetic diversity that has not been examined 
for the production of natural products. Culture-independent 
analysis of microbial communities using DNA extracted 
directly from environmental samples, which is commonly 
defined as metagenomics, has the potential to provide access 
to the biosynthetic capacity of uncultured bacteria. 7 

All of the genes required for the biosynthesis of a natural 
product, including genes that encode biosynthetic, regulatory, 
and self-immunity enzymes, are typically clustered on bacterial 
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chromosomes. Natural product gene clusters can range in size 
from a few kilobases to over 100 kilobases. The heterologous 
expression of natural product biosynthetic gene clusters cap- 
tured on individual eDNA clones has begun to provide access 
to some of the natural products encoded in the genomes of 
uncultured bacteria (see Figure 1). However, a major limita- 
tion of this strategy has been the inability to routinely 
construct very large eDNA libraries with inserts big enough 
to capture large biosynthetic pathways on individual clones. 
Figure 1 shows a collection of metabolites that have been iso- 
lated from the culture broths of soil-derived eDNA clones. In 
each case, a single cosmid/fosmid eDNA clone confers the pro- 
duction of the metabolites to a heterologous host. Successful 
functional metagenomic natural product discovery studies 
carried out on marine samples and other microbiomes have 
also largely been restricted to single clones. 13 ' 21 

While the construction of 30-40 kb insert cosmid libraries 
from environmental samples is now routine, the construction 
of larger insert libraries that can be used to capture large natu- 
ral product gene clusters has been challenging. Bacterial artifi- 
cial chromosome (BAC)-derived libraries are capable of cap- 
turing larger inserts but generally yield metagenomic libraries 
that are two to three orders of magnitude smaller than those 
constructed using cosmid-based cloning strategies. 22 Theoreti- 
cally, all gene clusters that are too large to be captured on a sin- 
gle cosmid-sized clone can be reassembled from collections of 
overlapping eDNA cosmid clones (Figure lb). Existing gene 
cluster assembly strategies depend on either unique restriction 
sites or i-mediated recombination to reassemble large DNA 
fragments. Both of these strategies are technically challenging 
when working with very large DNA fragments or with sequen- 
ces that span more than two overlapping clones. 11 " 17 Transfor- 
mation-associated recombination (TAR) in Saccharomyces cere- 
visiae relies on homologous recombination to selectively cap- 
ture a known sequence from a mixture of genomic DNA. 18 ' 19 
In TAR cloning protocols, genomic DNA and a "capture" vec- 
tor with short homology arms corresponding to sequences 
flanking the region of interest are cotransformed into S. cerevi- 
siae. The capture vector arms and homologous target DNA 
undergo recombination to yield a stable plasmid containing 
the targeted genomic region. TAR was originally developed to 
facilitate cloning large genomic fragments without having to 
construct and screen genomic DNA libraries. Recent studies 
extended the scope of this methodology by showing that it 
could be used to assemble 25 cotransformed overlapping DNA 
fragments into a complete 592-kb synthetic genome and that 
multiple PCR products could be assembled into small bio- 
chemical pathways. 20-22 These studies led us to believe that 
TAR could also be used to assemble large natural product gene 
clusters from multiple overlapping eDNA clones. 



In this report, we show that TAR in S. cerevisiae can be 
used to rapidly reassemble large natural product biosynthetic 
gene clusters from overlapping eDNA cosmid clones. The 
rich microbial diversity present in soils makes them attrac- 
tive, but challenging, starting points for the culture-inde- 
pendent discovery of new natural product biosynthetic gene 
clusters. Much of the difficulty in working with soil-derived 
eDNA libraries stems from their inherent complexity, which 
necessitates the construction of very large clone libraries to 
ensure that large biosynthetic pathways can be recovered in 
their entirety. Using two of the largest soil eDNA cosmid 
libraries reported to date as examples, we have also empiri- 
cally investigated the minimum size eDNA libraries will likely 
need to be to recover complete large natural product gene 
clusters on overlapping cosmid clones. Taken together, these 
studies provide an experimental framework for gaining 
access to large, intact natural product biosynthetic gene 
clusters from soil microbiomes. 

MATERIALS AND METHODS 

Library Construction and Formatting 

Top soil collected in Utah and California was used to construct cos- 
mid-based eDNA libraries following methods previously described. 23 
Briefly, the soil was incubated at 70°C in lysis buffer [2% sodium do- 
decyl sulfate (w/v), 100 mMTris-HCl, 100 mM ethylenediaminetetra- 
acetic acid (EDTA), 1.5 M NaCl, 1% cetyl trimethylammonium bro- 
mide (w/v)] for 2 h. Large particulates were then removed by centrifu- 
gation (4,000 X g, 30 min). DNA was precipitated from the resulting 
supernatant with the addition of 0.6 volumes of isopropyl alcohol, pel- 
leted by centrifugation (4000 X g, 30 min), washed with 70% ethanol 
and resuspended in a minimum volume of TE (10 mM Tris, 1 mM 
EDTA, pH 8). High molecular weight DNA that was purified from the 
crude extract by gel electrophoresis (1% agarose, 0.5 X Tris/Borate/ 
EDTA, 16 h, 20 V) was blunt-ended (End-It, Epicentre Biotechnolo- 
gies), ligated into precut pWEB or pWEB-TNC (Epicentre Biotechnol- 
ogies), packaged into lambda phage, and transduced into Escherichia 
coli (EC100, Epicentre Biotechnologies). Individual library aliquots 
equivalent to ~4000-5000 colony forming units were either plated on 
agar plates or inoculated into 5 ml of liquid LB and then allowed to 
incubate overnight at 37°C with the appropriate selection. Once colo- 
nies formed, the plate-grown aliquots were resuspended in 5 ml of LB. 
Matching glycerol stocks (15% glycerol) and DNA miniprep pairs 
were created from each unique library aliquot. The minipreps were 
arrayed in 8 X 8 grids corresponding to 250,000-320,000 total cosmid 
clones and DNA from the rows and columns of each grid was pooled. 
To facilitate library screening, pooled rows and columns were further 
combined to yield master aliquots, each representing a single 8X8 
grid of minipreps. Each unique E. coli transduction yielded three 
master aliquots (~750,000 clones) of the Utah library and one 
master aliquot (~320,000 clones) of the California library. In 
total, the Utah soil library contains ~10 million unique cosmid 
clones and the California soil library contains ~15 million 
unique cosmid clones. 



Biopolymers 



Cloning Large Natural Product Gene Clusters From the Environment 




FIGURE 1 (A) Natural products that have been isolated from individual soil derived eDNA 
clones are shown. These include terragines A-E (1-5) 8 , norcardamine (6) 8 , turbomycin A (7) 
and B (8) 9 , a C3-isocyanide functionalized indole derivative (9) 10 , erdacin (10) 11 , aliphatic dienic 
alcohol isomers (11,12) 12 , indirubin (13) 13 ' 14 , indigo (14) 13 ' 14 , deoxyviolacein (15) 15 , violacein 
(16) 15 , palmitoylputrescine (17) 16 , long chain enol esters (18) 17 , long chain eneamides (19) 17 , and 
various long chain JV-acyl amino acids (20-23). 18 ~ 20 (B) TAR-based gene cluster reassembly strat- 
egies can provide access to larger natural product gene clusters captured on overlapping eDNA 
clones. 
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Library Size Analysis 

DNA from each unique E. coli transduction reaction was used as a 
template in PCR reactions with degenerate primers designed to 
amplify /?-Ketoacyl synthase gene sequences (dprKS^, 5'-TTC 
GGSGGNTTCCAGWSNGCSATG-3' and dp:ACP, 5'-TCSAKSAG 
SGCSANSGASTCGTANCC-3'). 11,38 Each 25-/(1 PCR reaction con- 
tained 50 ng eDNA template, 2.5 /.iMof each primer, 2 mMdNTPs, 
IX ThermoPol Reaction Buffer (New England Biolabs), 0.5 U Taq 
DNA polymerase (New England Biolabs), and 5% dimethyl sulfox- 
ide. Reactions were cycled using the following touchdown protocol: 
initial denaturation (95°C, 2 min), then eight touchdown cycles 
[95°C, 45 s; 65°C (dt -TC/cycle), 1 min; 72°C, 2 min], 35 standard 
cycles (95°C, 45 s; 58°C, 1 min; 72°C, 2 min) and a final extension 
step (72°C, 2 min). 11 ' 38 Amplicons of the correct predicted size 
(~1.5 kb) were identified by gel electrophoresis, gel purified, and 
directly sequenced. In total, DNA from seven unique E. coli trans- 
ductions of the Utah library and 20 unique E. coli transductions of 
the California library was examined. 

Identification of Gene Clusters of Interest 

PCR reactions with degenerate primers designed to amplify 
/J-ketoacyl synthase gene sequences were used to detect Type II poly- 
ketide synthase (PKS) sequences. 11 ' 38 Degenerate primers designed 
to detect flavin-dependent halogenases (TyrohalF3: 5'-CGGCTGG 
TTCTGGTACATCCC-3', TyrohalR2: 5'-GAACTCGTAGAASACSCC 
GTACTC-3') were used to identify the nonribosomal peptide syn- 
thetase (NRPS) gene cluster. The FRI gene cluster was identified 
using primers that recognize conserved sequences in acyl-CoA 
ligases found in lipopeptide antibiotic gene clusters (DpFrEFWDl: 
5'-TSMTSCAGTACACSTCSGG-3' and DpFrEREVl: 5'-WDGTCGT 
ASGCGAAGTCSG-3'). Type II PKS sequences were amplified using 
the same PCR conditions outlined for the library size analysis. Fla- 
vin-dependent halogenases were amplified using the following PCR 
conditions: Each 20-/.il reaction contained primer added to a final 
concentration of 2.5 fiM, 0.5 /d of eDNA template (~100 ng), 
IX FailSafe Buffer G (Epicentre Biotechnologies), and 1 U of Taq 
DNA polymerase. Reactions were cycled using the following touch- 
down protocol: initial denaturation (95°C, 2 min); 9 touchdown 
cycles [95°C, 30 s; 70°C (dt -l°C/cycle), 30 s; 72°C, 30 s], 30 stand- 
ard cycles (95°C, 30 s; 60°C, 30 s; 72°C, 30 s), and a final extension 
step (72°C, 5 min). The acyl-CoA ligase homologues were identified 
using the following reaction conditions: 25 fi\ reactions contained 
primer added to a final concentration of 2.5 /iM, 0.5 fil of eDNA 
template (~100 ng), IX ThermoPol Buffer, 2 mM dNTPs, and 
0.5 U of Taq DNA polymerase. Reactions were cycled using the fol- 
lowing touchdown protocol: initial denaturation (95°C, 2 min); 
6 touchdown cycles [95°C, 30 s; 65°C (dt -l°C/cycle), 30 s; 72°C, 
30 s], 30 standard cycles (95°C, 30 s; 58°C, 30 s; 72°C, 30 s), and a 
final extension step (72°C, 2 min). Amplicons of the correct pre- 
dicted size were gel purified and directly sequenced. 

General Procedure for Clone Recovery 

Individual clones were recovered from a 4000-5000-membered sub- 
library by plating a 10 -5 or 10 -6 dilution of the corresponding glyc- 
erol stock into 96-well microtiter plates and screening the diluted 
cultures by whole-cell PCR with primers designed to recognize 
amplicons detected in the initial screen. PCR positive wells were 



then either subjected to a second round of dilution plating or plated 
directly on LB agar with ampicillin (50 /(g ml -1 ) to yield distinct 
colonies that were screened by whole-cell PCR to identify individual 
clones of interest. Each recovered cosmid was end-sequenced using 
vector-specific (pWEB, pWEB-TNC) universal primers [M13(— 40) 
and the T7 promoter]. All clones were fully sequenced using 454 
GLX FLX pyrosequencing, assembled using Newbler (Roche), and 
annotated using Genemark and BLASTX. 39 " 1 Gene cluster images 
were generated using MacVector. The amino acid substrate specific- 
ity for each adenylation domain found in the cryptic NRPS gene 
cluster was predicted using NRPSpredictor. 42 

pTARa Vector Construction 

The yeast ARSH4 (autonomous replicating sequence), CEN6 (plasmid 
maintenance element), and URA3 markers were obtained from 
pLLX13 by digestion with EcoRI and Hindlll. 23 After gel purification, 
the fragment was ligated into similarly digested pCCl-BAC (Epicentre 
Biotechnologies). The resulting vector was digested with Hpal and 
ligated to a Dral fragment from pOJ436 containing an origin of trans- 
fer (OriT), integrase and apramycrn resistance gene. 43 Transformation 
into EPI300 E. coli (Epicentre Biotechnologies) and selection on chlor- 
amphenicol (12.5 /.ig ml -1 ) and apramycin (50 /(g ml -1 ) yielded the 
capture vector pTARa (TAR-ready BAC with the Streptomyces attP 
integration system, GenBank accession number: GQ452294). 

TAR Cloning 

TAR cloning was initially developed to selectively isolate regions of 
genomes without the need to construct and screen a genomic 
library. 23 ' 24 ' 32 ' 33 ' 44 The procedures outlined below describe our adap- 
tation of these methods for the isolation of sequenced natural prod- 
uct gene clusters and the assembly of large natural product biosyn- 
thetic gene clusters captured on multiple overlapping eDNA clones. 

Pathway-Specific Capture Vector Construction 

The cycloheximide counter selection cassette (CYH2/fo/a) was PCR 
amplified using pLLX8 as a template following reported protocols. 23 
The cassette was amplified using primers pLLX8/fw/: 5'-TTTTCT 
AGAACGCGTTTAATTAAAATCTAAAGTATATATGAGTAAAC-3' and 
pLLX8/rv/: 5'-CCCTCTAGAGTTAACGTTTAAACAAAAAACGGTG 
AAAATGGGTGATAG-3'. Each 50- (A reaction contained IX FailSafe 
Buffer B (Epicentre Biotechnologies), 2.5 fiM of each primer, 100 ng 
of pLLX8 template, and 1 U of Taq DNA polymerase. Reactions were 
cycled using the following protocol: initial denaturation (95°C, 
2 min), 35 standard cycles (95°C, 30 s; 65°C, 30 s; 72°C, 3 min), and a 
final extension step (72°C, 7 min). The 2.95-kb PCR product was gel 
purified prior to capture vector assembly (MinElute™, Qiagen). 
eDNA clone assembly homology arms were PCR amplified in 25-/il 
reactions containing 100 ng of template cosmid, 2.5 /.iM of each 
primer, IX FailSafe Buffer D (Epicentre Biotechnologies), and 0.5 U 
Taq DNA polymerase. Reactions were cycled using the following pro- 
tocol: initial denaturation (95°C, 2 min), 35 standard cycles (95°C, 
1 min; 60°C, 1 min; 72°C, 1 min), and a final extension step (72°C, 
5 min). PCR primers for homology arms were designed to contain 
40 bp of homology to the pTARa vector and 40 bp of homology to 
the counter selection cassette. 23 These homology regions were incor- 
porated to allow pathway-specific capture vector construction using 
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recombination in S. cerevisiae. Upstream homology arm amplifica- 
tion primers contained a sense primer extension: 5'-ATATTAC 
CCTGTTATCCCTAGCGTAACTATCGATCTCGAG- 3' , and an anti- 
sense primer extension: 5'-CATATATACTTTAGATTTTAATTAAAC 
GCGTTCTAGAAAA- 3' , which add 40 bp of homology to pTARa and 
the counter selection cassette, respectively. The downstream targeting 
sequence sense primer extension is: 5'-CATTTTCACCGTTTTTTGT 
TTAAACGTTAACTCTAGAGGG- 3' , which provides homology to the 
counter selection cassette and the antisense primer extension is: 5'-TA 
ACAGGGTAATATAGAGATCTGGTACCCTGCAGGAGCTC-3' , which 
provides homology to pTARa. Each primer pair was designed to yield 
a 600- to 900-bp amplicon that acts as a homology arm in a pathway- 
specific capture vector used for a TAR reassembly reaction. 23 Cosmids 
XI 6 and V48 were used as templates to generate upstream and down- 
stream homology arms for the PKS gene cluster. Cosmids ZA41 and 
J2 were used as templates to generate upstream and downstream 
homology arms for the NRPS gene cluster. Cosmids 1679 and 201 
were used as templates to generate upstream and downstream homol- 
ogy arms for the FRI gene cluster. About 300 ng of purified Citrobacter 
koseri genomic DNA (MasterPure™ Complete DNA Purification Kit, 
Epicentre Biotechnologies) was used as a template to generate 
upstream and downstream homology arms for the colibactin gene 
cluster (GenBank accession number: AM229678). Each PCR amplified 
component was gel purified prior to its use in the assembly of a path- 
way-specific capture vector. 

For the assembly of a pathway-specific capture vector, 200 ng of 
pTARa was linearized with Nhel and added to 200 /ig of heat dena- 
tured single stranded carrier DNA (heated to 95°C for 10 min then 
kept on ice), 600 ng of CYHUbla counter selection cassette ampli- 
con 23 and 200 ng of an upstream and downstream homology arm 
amplicon pair prepared as described above. All components were 
added to lithium acetate prepared chemically competent CRY1-2 
(uracil deficient, ura~) yeast, plated on synthetic complete (SC) uracil 
dropout agar (Invitrogen) and incubated at 30°C. 45 Colonies typically 
began to appear within 24-48 h. Assembled capture vectors were iso- 
lated in bulk by resuspending yeast colonies from a 100 mm SC drop- 
out agar plate in 5 ml of 1 X phosphate buffered saline. Plasmid 
DNA was isolated from 1 ml of resuspended cells (ChargeSwitch™ 
Yeast Plasmid Isolation Kit, Invitrogen). About 100 ng of the purified 
DNA was transformed into electrocompetent EPI300 E. coli and 
plated on LB agar containing ampicillin (100 /(g ml -1 ), chloram- 
phenicol (12.5 /.ig ml -1 ), and apramycin (50 /(g ml -1 ) to yield a 
pathway-specific capture vector containing a counter-selection cas- 
sette. 

TAR Cloning and Pathway Assembly 

Direct TAR cloning of the colibactin gene cluster from genomic 
DNA was carried out using reported protocols. 24 ' 44 For eDNA path- 
way assembly, each cosmid to be used in an assembly reaction was 
initially linearized by digestion with Dral and the capture vector was 
linearized by digestion with Pmel. About 200 ng of each linearized 
cosmid and an equimolar amount (~100 ng) of a linearized path- 
way-specific capture vector were added to 200 fi\ of S. cerevisiae 
spheroplasts prepared as previously reported. 44 The transformed 
spheroplasts were added to 7 ml of top agar equilibrated to 50°C 
[1M sorbitol, 1.92 g l -1 SC uracil dropout supplement (Invitrogen), 
6.7 g l -1 yeast nitrogen base (Invitrogen), 2% glucose, 2.5% agar]. 
The top agar containing transformed spheroplasts was overlaid onto 
SC dropout agar containing 2.5 /(g ml -1 cycloheximide. The plates 



were incubated at 30°C and spheroplast growth was typically seen 
within 72 hours. The resulting recombinants were patched onto SC 
uracil dropout agar with cycloheximide (2.5 /(g ml -1 ) for overnight 
growth at 30°C. 

For initial PCR detection of reassembled pathways, a small por- 
tion of each yeast patch was resuspended in 10 fil of 20 mM NaOH 
and heated at 95°C for 10 min. 1.5 fi\ of the cell lysate was then used 
as a template in a 50-/(1 multiplex PCR reaction following the man- 
ufacturer's directions (Multiplex PCR Kit, Q solution™, Qiagen). 
The primer sets used in this analysis were designed to recognize 
unique regions from each overlapping cosmid clone that was used 
in an assembly reaction. In the colibactin TAR experiment, PCR 
primer pairs were designed to detect the previously reported boun- 
daries of the biosynthetic gene cluster. 46 

Analysis of TAR Recombined Clones 

Yeast recombinants that produced PCR amplicons of correct size for 
all portions of a pathway were grown overnight (30°C, 225 rpm) in 
2 ml of SC uracil dropout media (or on SC uracil dropout agar) with 
2.5 /ig ml -1 cycloheximide and TAR assembled pathways were isolated 
from these cultures (ChargeSwitch™, Invitrogen). Five microliters of 
ChargeSwitch™ prepared DNA (1/10 elution volume) was trans- 
formed into electrocompetent EPI300 E. coli which were outgrown at 
30°C for 2 h (225 rpm) and then plated on LB agar with 12.5 fig ml -1 
chloramphenicol. Whole-cell PCR was used to identify E. coli colonies 
containing correctly reassembled gene clusters. DNA was then isolated 
from 5 ml cultures of PCR positive E. coli transformants using alkaline 
lysis and isopropanol precipitation (CopyControl™ pCCl-BAC Induc- 
tion Protocol, Epicentre Biotechnologies). E. coli transformants con- 
taining the colibactin gene cluster were identified using eight sets of 
previously reported PCR primers designed to detect different ORF's in 
the pathway (data not shown). 46 Detailed restriction mapping was car- 
ried out on each reassembled pathway using an enzyme (PKS, EcoRI; 
NRPS, EcoRI; FRI, Bglll; Colibactin, Hindlll) that was predicted to 
yield restriction fragments that could be easily resolved using agarose 
gel electrophoresis (1% agarose, 0.5X Tris/Borate/EDTA, 30 V, over- 
night). The lambda Hindlll and 50-bp molecular weight makers were 
obtained from New England Biolabs. Full pathway sequencing for each 
gene cluster was deposited with GenBank under the following accession 
numbers: NRPS: GQ475282, FRI: GQ475284, and PKS: GQ475283. 

Conjugation and Preparation for Heterologous 
Expression 

Assembled pathways were transformed into SI 7-1 E. coli for conjugation 
into Streptomyces using published protocols. 43 All three reassembled 
eDNA gene clusters were successfully conjugated and chromosomally 
integrated into a number of Streptomyces including Streptomyces lividans, 
Streptomyces albus, and Streptomyces toyocaensis. 47 

RESULTS AND DISCUSSION 

Library Size Analysis 

The genes responsible for the biosynthesis of a natural product 
are typically clustered on a bacterial chromosome, and there- 
fore theoretically can be cloned on a single continuous frag- 
ment of eDNA. While the heterologous expression of biosyn- 
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thetic gene clusters captured on eDNA-derived clones has 
begun to yield novel natural products, many natural product 
gene clusters are too large to be routinely captured on individ- 
ual eDNA cosmid clones (see Figure 1). With metagenomic 
libraries of sufficient size and sequence coverage, large gene 
clusters that cannot be captured on a single cosmid clone 
could be accessed by recovering collections of overlapping 
eDNA clones. Soil microbiomes are among the most geneti- 
cally diverse environments characterized to date and are there- 
fore attractive starting points for the discovery of natural 
products using a metagenomic approach. 3 However, because 
of this complexity, it is difficult to predict the size a soil-based 
eDNA library must be to permit the recovery of overlapping 
clones from a diverse collection of large natural product gene 
clusters. We set out to empirically investigate this problem 
using eDNA libraries constructed from two different soil sam- 
ples. For this study, DNA isolated from soil collected in Utah 
was used to construct a series of independent 750,000-mem- 
bered eDNA cosmid libraries (^10,000,000 clones in total) 
and DNA isolated from a soil sample collected in California 
was used to construct a series of independent 320,000-mem- 
bered eDNA cosmid libraries (^15,000,000 clones in total). 

The reassembly of large natural product gene clusters from 
multiple overlapping eDNA fragments begins with the detec- 
tion of specific sequence(s) of interest located on two or more 
unique library clones. We therefore wanted to determine the 
point at which redundant sequences of interest began to regu- 
larly appear in unique eDNA library aliquots constructed from 
the same soil sample. Culture-based studies suggest that Type 
II (aromatic, iterative) PKS biosynthetic systems are common 
in bacteria and the PKS genes found in these systems are highly 
conserved. We therefore chose Type II PKS pathways as a 
model system for studying large (>30 kb) gene clusters present 
in soil-derived eDNA libraries. Both the California and Utah 
libraries were screened for the presence of /?-ketoacyl synthase 
(KS/j) gene sequences using degenerate PCR primers designed 
to recognize Type II PKS systems. 11 ' 38 In total, 19 distinct KSp 
gene sequences were amplified from the Utah library and 73 
distinct KSp gene sequences were amplified from the California 
library (see Figure 2). In the Utah library, redundant KS^ 
sequences began to regularly appear once ^3 X 10 6 clones had 
been examined, while in the California library redundant KSp 
sequences began to regularly appear once ^2.25 X 10 6 clones 
had been examined. Additional screens using primers designed 
to recognize other conserved natural product biosynthetic gene 
sequences have shown similar results. In these studies, redun- 
dant sequences begin to regularly appear once libraries exceed 
1-3 X 10 b clones in size (data not shown). The libraries used 
in our efforts to recover natural product gene clusters to be 
used in TAR assembly experiments were therefore expanded 



until they contained at least 1—1.5 X 10 7 unique clones, which 
corresponds to 5-10 times the number of clones needed to 
identify the first redundant Type II PKS sequences. While even 
an eDNA library of 1-1.5 X 10 7 clones is unlikely to permit 
the recovery of rare gene clusters, our analysis suggests that it 
will likely contain collections of clones encompassing complete 
PKS gene clusters and, by extension, overlapping clones from 
many other types of biosynthetic gene clusters found in the 
genomes of uncultured bacteria. 

Natural Product Gene Cluster Identification and 
Recovery 

In excess of 35,000 unique microbial natural products have 
been characterized using culture-based methods. 50 ' 51 This 
amazing assortment of natural products is biosynthesized 
using a much smaller number of conserved enzyme families. 
The structural diversity seen in natural products appears to 
arise in large part from the natural combinatorial shuffling of 
these conserved biosynthetic enzyme families. 52 Degenerate 
primers designed to recognize conserved natural product bio- 
synthetic gene sequences should therefore be useful for identi- 
fying eDNA derived gene clusters that encode the biosynthesis 
of a diverse collection of small molecules. In this study, three 
different sets of degenerate primers were used to recover three 
large natural product biosynthetic gene clusters from the Utah 
and California soil eDNA libraries. A cryptic Type II PKS gene 
cluster was identified using the Type II PKS-specific degener- 
ate primers we used in our initial library size analysis. 11 ' 38 A 
cryptic NRPS gene cluster was identified using degenerate pri- 
mers designed to amplify flavin-dependent halogenases known 
to tailor aromatic amino acids found in halogenated nonribo- 
somal peptides. Degenerate primers designed to recognize 
acyl-CoA ligases found in lipopeptide antibiotic gene clusters 
were used to identify a gene cluster that is predicted to encode 
the known metabolite friulimicin. These three eDNA-derived 
gene clusters are referred to as the PKS, NRPS and FRI gene 
clusters, respectively. The PKS and NRPS gene clusters were 
found in an eDNA library derived from topsoil collected in 
Utah while the FRI gene cluster was found in an eDNA library 
derived from desert soil collected in California. 

Individual cosmid clones containing genes recognized by 
the degenerate primers used in initial library screens were 
recovered from the appropriate library and then end 
sequenced (see Figure 3). PCR primers designed against the 
end sequences were subsequently used to identify and recover 
overlapping clones from the same library. The process of 
clone recovery and end sequencing was iteratively repeated 
until genes predicted to be involved in primary metabolism 
were found on the distal ends of a recovered cosmid (see 
Figure 3). This initial end-sequencing analysis suggested that 
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FIGURE 2 Degenerate primers targeting minimal Type II PKS genes were used to identify KS/< 
sequences present in unique eDNA library aliquots constructed from soil samples collected in Utah 
(A) and California (B). ClustalW 48 derived phylogenetic trees of the KSys sequences identified in 
these screens are shown. The aliquots from which sequences were amplified and the point at which 
they began to reappear in the library (red) are shown as a heatmap. 



the NRPS and FRI gene clusters were recovered on three 
cosmids each (NRPS: clones ZA41, Q87, J2; FRI: clones 1697, 
1451, 201). The PKS gene cluster appeared to be present on 
two overlapping cosmids (PKS: clones X16, V48). 

Each clone that was predicted to be part of a gene cluster 
was fully sequenced and annotated. The eDNA- derived FRI 
gene cluster and the friulimicin gene cluster from A. friulien- 
sis have the same gene organization and are 89% identical 
over the 68-kb region that is predicted to comprise the func- 
tional biosynthetic pathway. 53 A comparison of these two 
gene clusters suggests that the entire FRI gene cluster was 
likely captured on the three overlapping eDNA cosmids that 
were recovered. While the eDNA-derived PKS and NRPS 
gene clusters do not closely resemble any known gene clus- 
ters, the appearance of primary metabolic enzymes in the 
sequence surrounding the conserved natural product biosyn- 



thetic genes found on these clones suggests they were also 
likely recovered in their entirety. Sequencing of a fourth over- 
lapping clone that extends 20 kb beyond the NRPS gene clus- 
ter found no enzymes associated with secondary metabolism. 
As suggested by our initial eDNA library size analysis, cosmid 
libraries containing in excess of 10 million clones appear to 
provide sufficient coverage of soil metagenomes to allow 
access to a diverse range of complete natural product biosyn- 
thetic gene clusters. 

TAR Vector Design and Construction 

To facilitate TAR reassembly of large natural product gene 
clusters as well as subsequent heterologous expression studies 
with reassembled pathways, we created pTARa, a BAC -based 
S. cerevisiae/E. coli/Streptomyces shuttle capture vector (Figure 
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FIGURE 3 (A) PCR with degenerate primers was used to identify 
biosynthetic genes of interest in large library pools (1) and then to 
subsequently locate these same sequences in arrays of smaller library 
aliquots ( + ). Whole cell PCR of serially diluted smaller library ali- 
quots was used to recover individual cosmids of interest (2). Over- 
lapping clones were iteratively recovered (3) until complete biosyn- 
thetic pathways were identified (4). (B) The topology of the overlap- 
ping clones that are predicted to comprise the eDNA derived PKS, 
NRPS, and FRI gene clusters is shown. 

4a). This vector contains elements that allow pathways to 
be assembled in S. cerevisiae, characterized and maintained 
in E. coli, and conjugatively transferred into a wide range 
of Streptomycetes for heterologous expression studies. 43 We 
included these elements to facilitate Streptomyces-based heter- 
ologous expression studies, but any number of species-specific 
genetic elements can be incorporated into pTARa to allow the 



transfer of pathways into a wide variety of bacterial hosts. 23 As 
a demonstration of the utility of pTARa as a shuttle vector, we 
propagated the vector in S. cerevisiae (CRY1-2), transformed 
and isolated the vector from E. coli and successfully conjuga- 
ted into a number of different Streptomycetes including 
S. toyocaensis, S. lividans, and S. albus. 
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FIGURE 4 (A) pTARa contains elements that allow for the rapid as- 
sembly and propagation of pathways in S. cerevisiae (green), the trans- 
formation and analysis of these pathways in E. coli (blue) and the inte- 
grative conjugation of assembled pathways into Streptomyces (yellow). 
For capture vector construction, pathway-specific upstream (US-blue), 
and downstream (DS-red) homology arms, as well as a counter selec- 
tion cassette (cyh/bla) are incorporated into the capture vector. 23 ' 24 
During recombination, the counter selection cassette is exchanged for a 
TAR cloned gene cluster (B) or TAR reassembled eDNA pathway (C). 
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Capturing Natural Product Gene Clusters from 
Sequenced Genomes Using pTARa 

The cloning of natural product gene clusters from cultured 
organisms traditionally requires the construction and screen- 
ing of a genomic DNA library. 4,M Using TAR cloning, a 
sequenced biosynthetic gene cluster of any size can be direcdy 
cloned without the need to construct or screen a genomic 
library (Figure 5c). 23 To demonstrate the utility of pTARa for 
culture-based natural products research, we directly cloned the 
56-kb colibactin gene cluster directly from genomic DNA iso- 
lated from the cultured bacterium, C. koseri. 24 ' 44 Previous 
studies determined the functional boundaries of the colibactin 
gene cluster via transposon mutagenesis. 46 In order to TAR 
clone this gene cluster, we simply designed a pathway-specific 
capture vector using this information (Figures 4b and 5), and 
co -transformed the capture vector and C. koseri genomic 
DNA into S. cerevisiae spheroplasts. 24 ' 44 We screened yeast 
spheroplasts using colibactin gene cluster specific PCR primers 
and were able to quickly identify clones containing intact coli- 
bactin gene clusters. Detailed restriction mapping of the TAR 
cloned pathway confirmed that we had specifically cloned 
the colibacin gene cluster (pTARa-Colibactin) directly from 
C. koseri genomic DNA (Figure 5b). 46 As demonstrated by 
this experiment, TAR cloning should provide a general and 
rapid means to access intact natural product biosynthetic gene 
clusters from sequenced microorganisms without the need to 
construct or screen a genomic library (Figure 5c). 

TAR Assembly of Multiclone Gene Clusters 

For each reassembly experiment, we constructed a unique path- 
way-specific capture vector with homology arms corresponding 
to sequences at the proximal and distal ends of the gene cluster 
to be reassembled (Figures 4c and 6). Homologous recombina- 
tion in S. cerevisiae is stimulated by the presence of double 
stranded breaks adjacent to recombination sites. 32 The individ- 
ual cosmids to be used in the reassembly of a gene cluster were 
therefore linearized by restriction digestion with Dral and then 
cotransformed with a linearized pathway-specific capture vector 
into competent CRY1-2 S. cerevisiae. Dral, which recognizes the 
AT rich hexamer, TTTAAA, digests the cosmid backbone, yet 
rarely cuts in GC rich sequences found in biosynthetic gene 
clusters thus providing a means to generate linear DNA frag- 
ments for TAR reassembly reactions. The concentration of the 
components used in the cotransformation step was empirically 
determined and selected to yield, on average, one assembled 
construct per spheroplast. After 3-5 days of recovery on SC ura- 
cil dropout agar, recovered spheroplasts were restruck on new 
SC uracil dropout agar plates. This step is necessary to reduce 
the chance of cross contamination caused by DNA from the 
TAR reaction during the PCR analysis that is used to identify 
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FIGURE 5 (A) We used pTARa to directly and specifically clone 
the colibactin gene cluster from C. koseri genomic DNA. Predicted 
Hindlll cut sites and restriction fragment sizes are marked on the 
map of the pTARa-Colibactin construct. The size of the gene cluster 
is listed. (B) The experimentally determined Hindlll restriction map 
of pTARa-Colibactin is shown. Two images of the same digest were 
taken at different points during electrophoresis to highlight frag- 
ment sizes more clearly (M = Lambda Hindlll digest, M* = 50 bp 
ladder). (C) TAR cloning of gene clusters circumvents the need to 
construct and screen a genomic library. 

yeast colonies with assembled gene clusters. Yeast colonies were 
then screened using multiplex PCR with primers specific to 
each unique cosmid fragment predicted to be present in a reas- 
sembled gene cluster construct. Between 30 and 70% of the 
yeast colonies were found to be PCR positive for all fragments 
predicted to be present in a pathway. Using this approach we 
were able to rapidly identify yeast colonies that contained intact 
biosynthetic gene clusters. 

Large constructs isolated from PCR positive yeast clones 
were electroporated into E. coli and analyzed by detailed 
restriction analysis (see Figure 6). In each case, the large con- 
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FIGURE 6 A) Experimentally determined restriction maps and predicted restriction enzyme cut sites 
for each reconstructed gene cluster are shown. The size of each gene cluster is listed for clarity. (M = 
Lambda Hindlll digest, M* = 50 bp ladder). B) The overlapping cosmids (black) comprising a complete 
biosynthetic pathway is shown above the region targeted for TAR assembly (green line). The individual 
building blocks that are predicted to be used by the conserved modules (PKS and NRPS) found in these 
biosynthetic pathways appear below each gene cluster (DHPG = dihydroxyphenylglycine). 42 



struct obtained from a TAR reassembly reaction produced a 
restriction map that was identical to the map predicted to arise 
from assembling the individual overlapping clones used in the 
reaction (see Figure 6). The 39 kb PKS gene cluster was suc- 
cessfully subcloned from the central region of cosmids XI 6 and 
V48, two cosmids that contain 2.1 kb of overlap. The entire 89- 



kb cryptic NRPS gene cluster was successfully reconstructed in 
a single S. cerevisiae spheroplast transformation reaction from 
three overlapping eDNA cosmid clones. In a similar fashion, 
we reassembled the 90 -kb eDNA- derived FRI gene cluster using 
a single S. cerevisiae spheroplast transformation reaction and 
three overlapping eDNA-derived cosmid clones. 
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While the PKS and NRPS gene clusters were initially 
assembled from fully sequenced sets of cosmids, reassembly 
experiments can also be performed in the absence of compre- 
hensive sequencing. The FRI gene cluster was originally reas- 
sembled with only end- sequencing data for each cosmid clone 
predicted to comprise the complete gene cluster (Figures 4c 
and 6). A capture vector based on the end-sequencing data 
from the distal ends of the two outermost clones, cosmids 1679 
and 201, was used to reassemble the gene cluster (see Figure 3). 
We confirmed the successful reassembly of the fragments using 
PCR and by comparing restriction maps of the reassembled 
construct with those produced by the cosmids used in the reas- 
sembly experiment (data not shown). Subsequent full sequenc- 
ing of the clones comprising the FRI gene cluster confirmed 
the restriction mapping and successful sequencing-independ- 
ent TAR assembly experiment (see Figure 6). 

Traditional gene cluster assembly strategies can become 
technically impractical when working with large naturally 
derived DNA sequences. Unique and conveniently located 
restriction sites needed for traditional "cut and paste" strat- 
egies are often not available when working with long natural 
DNA sequences. Recently, i-based recombination has been 
used to reconstruct functional gene clusters, circumventing 
many of the problems associated with traditional strategies. 26 
Lambda-based recombination becomes difficult, however, for 
large gene clusters captured on multiple overlapping clones 
because it requires the step-wise recombination of two clones 
at a time. This step-wise recombination process requires the 
introduction of a unique selectable marker into each fragment 
used in an assembly experiment. As demonstrated here, TAR- 
dependent assembly of multiclone natural product gene clus- 
ters can be performed in a single reaction without any of these 
barriers. The maximum number of DNA fragments that can 
be simultaneously assembled in TAR experiments has yet to be 
determined, but even the largest gene clusters are unlikely to 
require more than three or four overlapping cosmids which is 
well within the established limits of TAR. 34 ' 35 ' 56 

CONCLUSIONS 

Previous studies have demonstrated that metagenomic strat- 
egies can be used to uncover metabolites encoded by gene 
clusters captured on individual soil-derived eDNA clones (see 
Figure 1). Cloning large natural product gene clusters presents 
a challenge for both culture dependent and culture independ- 
ent studies. We have shown that TAR can be used to rapidly 
reassemble overlapping eDNA-derived clones into a single 
construct containing large eDNA derived natural product gene 
clusters. We have also shown that TAR can be used to direcdy 
and specifically clone natural product gene clusters from 
sequenced organisms without constructing and screening a 



genomic library. TAR-dependent assembly of natural product 
gene clusters from overlapping clones found in eDNA soil- 
libraries provides an experimental framework for rapidly 
accessing intact natural product gene clusters that exceed con- 
ventional eDNA cloning limits (Figure lb). In doing so, it 
eliminates one of the major roadblocks associated with current 
metagenomic natural product discovery efforts. In this study, 
this experimental approach provided access to both a new 
example of what was thought to be a rare gene cluster (FRI) as 
well as what appear to be new gene clusters (PKS, NPRS). The 
heterologous expression of large TAR-assembled gene clusters 
should form a basis for the identification of new natural prod- 
ucts from eDNA. The major remaining challenge to the dis- 
covery of new natural products from uncultured bacteria, that 
of heterologous expression, is not unique to culture-independ- 
ent studies and will likely need to be addressed using many 
different gene cluster specific strategies. 

The authors thank Stephen Lory (Harvard Medical School) for 
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(J. Craig Venter Institute) for experimental suggestions and the 
core genomics facility at Memorial Sloan Kettering Cancer Center 
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